Final Project DSC 540

Vismay Patel

EDA ANALYSIS AND IMPLEMENTATION OF DIFFERENT TECHNIQUES ON BANKRUPTCY DATASET

Loading the dataset as a csv file

No missing values in the dataset


Changing column names

Below are the new columns that will be assigned to the dataset

Next steps:

Check if data is balanced or not.

Correlation between the variables.

Visualization to understand the nature of variables with the y attribute and relation with each other.

Parameter selection: PCA and chi-square method.

Modeling a decision tree classifier.

Modeling a SVM machine with linear and rdf kernels.

From the above plot we can see that the data is imbalanced, and we will move forward with Exploratory analysis to get more insight on the same.

Correlation plot using heat map


Generating histograms

Columns 2-11 histogram (**Note that column 1 is Y/Dependant variable (Bankrupt))


Columns 2-11 distribution plot (**Note that column 1 is Y/Dependant variable (Bankrupt))


Column 2-11 pariplot

Columns 2-11 correlation plot using heatmap


Columns 12-21 histogram


Columns 12-21 Distribution plot with KDE curve



Column 12-21 Correlation plot using heat map


Column 22-31 histogram


Column 22-31 Distribution plot

Column 21-31 Correlation matrix using heat map


Columns 32-41 histogram


Column 32-41 distribution plot


Columns 32-41 Correlation plot using heatmap


Column 42-51 histogram


Column 42-51 distribution plot


Column 42-51 Correlation plot using heat map


Column 52-61 histogram


Column 52-61 distribution plot


column 52-61 correlation plot using heat map


Column 62-71 histogram


Column 62-71 distribution plot

Columns 61 - 71 correlation plot using heat map


Column 72-81 histogram


Column 72-81 distribution plot


Column 71-81 correlation plot using heat map


Column 82-91 histogram


Column 82-91 distribution plot


Column 81-91 correlation plot using heatmap


Column 92-96 histogram


Column 92-96 distribution plot


Column 91-96 correlation plot using heat map


After looking at all the visualizations and correlation plots we can conclue that the data points are either extremely correlated with some of the components or show no correlation with any of the components.

So, we need to use SMOTE analysis to balance the data.

SMOTE is performed on training data so we will split the data into test and training set first and then using the smote training data we will perform the models and further analysis.


Linear and multivariate discriminant analysis

Iam converting all the negative variables her into positive variables for chi-square feature selection it takes positive variables only


Tuning LDA Huperparameter



Decision Tree classifier

Feature selection

using cross validation to select features

Evaluating the best featurs on test set

We will now change the criterian parameter to see how the model perform with different parameter

Evaluating the accuracy base on the depth of the tree

implementing svm

NO PCA